DataSet Introduction
This is a dataset: https://www.kaggle.com/datasets/jacksondivakarr/car-crash-dataset?select=new+dataset.xlsx I found on kaggle showing car crashes from the year 2003-2015 on car accidents across monroe county. It has datapoints on year, day, hour, collision type, injury type, primary reason for the accident, location of the accident, as well as the latitude and longitutde of the accident.
Imported Tools
Tools that I imported were pandas, for data manipulation. numpy, for computations. matplotlib, for visualizations like bar charts and line charts. seaborn for visualizations like the correlation table that use heatmaps. and plotly express, for interactive graphs used in the map. as well as a multitude of scikit learns that I did not use hwoever imported just incase.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
file_path = "C:/Users/19258/Downloads/new dataset.xlsx"
data = pd.read_excel(file_path)
display(data.head())
data.info()
| Year | Month | Day | Weekend? | Hour | Collision Type | Injury Type | Primary Factor | Reported_Location | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015 | 1 | 5 | Weekday | 0.0 | 2-Car | No injury/unknown | OTHER (DRIVER) - EXPLAIN IN NARRATIVE | 1ST & FESS | 39.159207 | -86.525874 |
| 1 | 2015 | 1 | 6 | Weekday | 1500.0 | 2-Car | No injury/unknown | FOLLOWING TOO CLOSELY | 2ND & COLLEGE | 39.161440 | -86.534848 |
| 2 | 2015 | 1 | 6 | Weekend | 2300.0 | 2-Car | Non-incapacitating | DISREGARD SIGNAL/REG SIGN | BASSWOOD & BLOOMFIELD | 39.149780 | -86.568890 |
| 3 | 2015 | 1 | 7 | Weekend | 900.0 | 2-Car | Non-incapacitating | FAILURE TO YIELD RIGHT OF WAY | GATES & JACOBS | 39.165655 | -86.575956 |
| 4 | 2015 | 1 | 7 | Weekend | 1100.0 | 2-Car | No injury/unknown | FAILURE TO YIELD RIGHT OF WAY | W 3RD | 39.164848 | -86.579625 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53943 entries, 0 to 53942 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 53943 non-null int64 1 Month 53943 non-null int64 2 Day 53943 non-null int64 3 Weekend? 53875 non-null object 4 Hour 53718 non-null float64 5 Collision Type 53937 non-null object 6 Injury Type 53943 non-null object 7 Primary Factor 52822 non-null object 8 Reported_Location 53908 non-null object 9 Latitude 53913 non-null float64 10 Longitude 53913 non-null float64 dtypes: float64(3), int64(3), object(5) memory usage: 4.5+ MB
Question
The overarching question that I had was if there is a relation betweeen any of these factors in causing car accidents, and which factors play the largest role. Furthermore how do these different aspects of car accidents relate to one another.
accidents_per_year = data.groupby("Year").size().reset_index(name="Total Accidents")
X = accidents_per_year["Year"].values.reshape(-1, 1)
y = accidents_per_year["Total Accidents"].values
model = LinearRegression()
model.fit(X, y)
plt.scatter(X, y, label="Data")
plt.plot(X, model.predict(X), label="Trend Line")
plt.xlabel("Year")
plt.ylabel("Total Accidents")
plt.title("Trend of Yearly Accidents")
plt.legend()
plt.show()
print("Intercept:", model.intercept_)
print("Slope:", model.coef_[0])
Intercept: 23168.73076923077 Slope: -9.467032967032967
Trend of Yearly Accidents Analysis
This is a graph showing the trend of accidents across each year, using a linear regression model we see that it suggests a negative trend of around 9.47 however thats mainly due to the extreme outlier year in 2003, outside of that we can see that the accidents per year are generally within 500 accidents per year of eachother with esetialy random variation each year.
accidents_per_day = data["Day"].value_counts().sort_index()
accidents_per_day.plot(kind="bar", color="skyblue", edgecolor="black")
plt.title("Number of Accidents per Day", fontsize=16)
plt.xlabel("Day of the Week", fontsize=14)
plt.ylabel("Number of Accidents", fontsize=14)
plt.xticks(rotation=0)
plt.show()
Number of Accidents per Day Analysis
This graph is showing the total number of accidents per day across 2003-2015, one thing about the dataset that is important to note is that day 1 is monday. we can see that the day with the most accidents is saturday and the day with the lowest is monday all the other days of the week are fairly simmilar in terms of number of accidents per day.
total_accidents_per_day = data.groupby("Day").size()
average_accidents_per_day = total_accidents_per_day / data["Year"].nunique()
for day, avg_accidents in average_accidents_per_day.items():
print(f"Day: {day}, Average Accidents: {avg_accidents:.2f}")
Day: 1, Average Accidents: 407.00 Day: 2, Average Accidents: 574.85 Day: 3, Average Accidents: 625.62 Day: 4, Average Accidents: 606.15 Day: 5, Average Accidents: 624.77 Day: 6, Average Accidents: 744.62 Day: 7, Average Accidents: 566.46
average_accidents_1_to_5 = average_accidents_per_day.loc[1:5].mean()
average_accidents_6_to_7 = average_accidents_per_day.loc[6:7].mean()
print(f"Average Accidents (Weekdays): {average_accidents_1_to_5:.2f}")
print(f"Average Accidents (Weekends): {average_accidents_6_to_7:.2f}")
Average Accidents (Weekdays): 567.68 Average Accidents (Weekends): 655.54
Average Accidents Data
The above data is taking the data derived from the graph to get a better undertsnanding of a per year basis. To help put an average on the above graph to get another way of understanding the graph. Furthermore I grouped the averages from above setting days 1-5 as the weekdays and days 6-7 as the weekends to see if there is any relation between the number of accidents on the weekday versus the weekend. It is clear that the weekend averages almost 90 more accidents than the weekdays, With the most accidents being on Saturday, and the least being on Monday.
accidents_per_month = data["Month"].value_counts().sort_index()
accidents_per_month.plot(kind="bar", color="lightcoral", edgecolor="black")
plt.title("Number of Accidents per Month", fontsize=16)
plt.xlabel("Month", fontsize=14)
plt.ylabel("Number of Accidents", fontsize=14)
plt.xticks(ticks=range(12), rotation=0)
plt.show()
Month/Season Accident Analysis
This is a bar graph depicting the total number of accidents from 2003-2015 per month. we see that the months with the lowest accidents are march, june and july. We also see a rise start in august with a peak number of accidents in october. Generally we see that the fall seasons are responsible for a greater number of accidents in comparison to other seasons. This is most likely due to the poorer weather in the fall season.
accidents_by_hour = data["Hour"].value_counts().sort_index()
plt.plot(accidents_by_hour.values, marker='o', color='steelblue')
plt.title("Accidents by Time of Day", fontsize=16)
plt.xlabel("Hour of the Day", fontsize=14)
plt.ylabel("Number of Accidents", fontsize=14)
plt.xticks(ticks=range(0, 24), labels=[str(i) for i in range(24)], rotation=0)
plt.show()
Analysis on Relation Between Time of Day and Accidents Freqeuncy
This is a line graph showing the total number of accidents by the time of day from 2003-2015 in military time. We see that the lowest number of accidents happen at 4 in the morning however starting 6AM to 8AM there is a massive increase, as well as from 10AM to 12PM. after a slight decline to 13(1PM) from then to 17(5PM) there is a massive spike hitting its peak at 17(5). from 17(5) onwards there is a decline. This has a strong relation to human work times and rush hour as can see it is interesting to note that the evening traffic rush is associated with much more accidents than the morning traffic rush.
collision_type_counts = data["Collision Type"].value_counts()
plt.figure(figsize=(10, 10))
collision_type_counts.plot.pie(autopct="%1.1f%%", startangle=0)
plt.title("Collision Type")
plt.show()
Most Common Accident Types
This is a pie chart shoing the percentage of collision types from 2003-2015. We see that 2 car accidents are by far the most common accident type at 67.9 percent followed by 1 car at 19 percent and 3 car at 5.8 percent. other accident types include, 3+ cars, moped, bus, pedestrian and motorcycle
locations = data[['Latitude', 'Longitude', 'Reported_Location', "Hour", "Collision Type", "Injury Type", "Primary Factor"]].dropna()
fig = px.scatter_mapbox(
locations,
lat="Latitude",
lon="Longitude",
hover_name="Reported_Location",
hover_data={"Hour": True, "Injury Type": True, "Primary Factor": True,"Latitude": False, "Longitude": False,},
color="Collision Type",
title="Interactive Map of Locations",
mapbox_style="open-street-map",
zoom=8,
)
fig.update_layout(
mapbox=dict( center={"lat": 39.25, "lon": -86.45}),
title="Accidents in Monroe County ",
title_x=0.5
)
fig.show()
Accident Visualization
This is a map using longitutude and latitude from every accident from 2003-2015 representing the location of every accident. Each accident is colorcoded so you know the collsion type of the accident on the map. furthermore it also shows the injuty type, primary factor as well as the hour in which the accident happened. This data is based around Bloomington Indiana and we can see that most the accidents are happening alonng the major roads in the heart of the city rather than suburban areas.
collision_injury_data = data[['Collision Type', 'Injury Type']]
correlation = pd.crosstab(collision_injury_data['Collision Type'], collision_injury_data['Injury Type'])
sns.heatmap(correlation, annot=True, fmt="d", cmap="YlGnBu")
plt.title('Correlation Between Collision Type and Injury Type')
plt.xlabel('Injury Type')
plt.ylabel('Collision Type')
plt.show()
Relation Between Collision Type and Injury Type
This is a heatmap of sorts that shows the relation between collision type and injury type mainly to get an understanding of what the relation between accident types and injury types are. we see the most common outcome is no injury across all accident types interesting to note is that motorcycles who make up 1.9 % of accidents have the highest fatality rate at 2.1% followed by pedestrians at 1.31%. Due to high frequency we see that 2 car collisions have the an extremley low ftality rate at 0.08%, the lowest is cyclist at 0%
primary_factor_counts = data['Primary Factor'].value_counts()
primary_factor_table = pd.DataFrame({
'Primary Factor': primary_factor_counts.index,
'Number of Accidents': primary_factor_counts.values
})
print(primary_factor_table)
Primary Factor Number of Accidents 0 FAILURE TO YIELD RIGHT OF WAY 11193 1 FOLLOWING TOO CLOSELY 7359 2 OTHER (DRIVER) - EXPLAIN IN NARRATIVE 6158 3 UNSAFE BACKING 5188 4 RAN OFF ROAD RIGHT 2925 5 DISREGARD SIGNAL/REG SIGN 2206 6 SPEED TOO FAST FOR WEATHER CONDITIONS 1921 7 IMPROPER TURNING 1843 8 ANIMAL/OBJECT IN ROADWAY 1688 9 DRIVER DISTRACTED - EXPLAIN IN NARRATIVE 1656 10 UNSAFE SPEED 1499 11 ROADWAY SURFACE CONDITION 1270 12 LEFT OF CENTER 1078 13 IMPROPER LANE USAGE 985 14 ALCOHOLIC BEVERAGES 805 15 UNSAFE LANE MOVEMENT 756 16 OVERCORRECTING/OVERSTEERING 597 17 IMPROPER PASSING 496 18 OTHER (VEHICLE) - EXPLAIN IN NARRATIVE 472 19 OTHER (ENVIRONMENTAL) - EXPLAIN IN NARR 418 20 BRAKE FAILURE OR DEFECTIVE 361 21 PEDESTRIAN ACTION 292 22 DRIVER ASLEEP OR FATIGUED 267 23 DRIVER ILLNESS 182 24 VIEW OBSTRUCTED 175 25 CELL PHONE USAGE 141 26 NONE (DRIVER) 116 27 WRONG WAY ON ONE WAY 103 28 TIRE FAILURE OR DEFECTIVE 84 29 RAN OFF ROAD LEFT 60 30 PRESCRIPTION DRUGS 58 31 GLARE 53 32 ACCELERATOR FAILURE OR DEFECTIVE 50 33 INSECURE/LEAKY LOAD 46 34 OBSTRUCTION NOT MARKED 37 35 STEERING FAILURE 31 36 PASSENGER DISTRACTION 31 37 ILLEGAL DRUGS 29 38 OTHER TELEMATICS IN USE 28 39 OVERSIZE/OVERWEIGHT LOAD 26 40 ENGINE FAILURE OR DEFECTIVE 25 41 HEADLIGHT DEFECTIVE OR NOT ON 20 42 HOLES/RUTS IN SURFACE 15 43 TRAFFIC CONTROL INOPERATIVE/MISSING/OBSC 12 44 NONE (ENVIRONMENTAL) 12 45 NONE (VEHICLE) 11 46 OTHER LIGHTS DEFECTIVE 10 47 TOW HITCH FAILURE 8 48 ROAD UNDER CONSTRUCTION 7 49 JACKKNIFING 6 50 SEVERE CROSSWINDS 4 51 LANE MARKING OBSCURED 3 52 VIOLATION OF LICENSE RESTRICTION 3 53 SHOULDER DEFECTIVE 2 54 UTILITY WORK 1
Most Common Reason for Accidents
this shows the top reasons for accidents across 2003-2015 we see that faliure to yield right of way is the most common by a large margin and utility work is the least common
top_20_locations = data['Reported_Location'].value_counts().head(20)
top_20_locations_df = top_20_locations.reset_index()
top_20_locations_df.columns = ['Reported_Location', 'Count']
print(top_20_locations_df)
Reported_Location Count 0 E 3RD ST 375 1 W 3RD ST 222 2 SR37N & VERNAL 197 3 3RD ST 195 4 S WALNUT ST 172 5 E 10TH ST 153 6 N WALNUT ST 124 7 S COLLEGE MALL RD 123 8 SR37 & VERNAL 123 9 WALNUT ST 117 10 SR37S & VERNAL 113 11 SR37S & TAPP 112 12 3RD ST & JORDAN 108 13 10TH & COLLEGE AVE 105 14 EAST 3RD ST 104 15 E 17TH ST 102 16 SR37 & SR45 98 17 3RD ST & COLLEGE MALL 96 18 N FEE LN 94 19 13TH & INDIANA AVE 87
Most Common Accident Locations
This shows the top 20 reported locations and te amount of accidents that have occured, E 3rd street anis the most followed by W 3rd street the least out of the top 20 is 13th and indiana avenue
categorical_columns = ["Collision Type", "Injury Type", "Primary Factor", "Reported_Location"]
data[categorical_columns] = data[categorical_columns].apply(lambda col: pd.factorize(col)[0])
numeric_columns = ["Year", "Month", "Day", "Hour", "Collision Type", "Injury Type", "Primary Factor", "Reported_Location"]
updated_correlation_matrix = data[numeric_columns].corr()
sns.heatmap(updated_correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True, square=True)
plt.title("Correlation Matrix", fontsize=16)
plt.show()
Final Takeaways
In summation we see there are factors that relate to increased accidents such as location, time of day, month, the day itself. However the relation between each factor is not as strong outside of collision type and injury type. With the possible exception of primary factor and collision type. My main takeway from this is that the relation between car accidents is something that cannot be necessairly seen on a correlation, but rather something that has to be nunanced when observing data. Accidents are entirley on humans and each person is different. The Human aspect plays the single most important role as such as how drivers change the way the drive during certain seasons, rush hour. lack of understanding road signs, not paying attention to the road and so on. The only real thing that we can do to reduce accidents is be simply following the rules and by being a smart and defensive driver.